This document will examine and interpret what conclusions can be made using the “larger_sales_dataset.csv” obtained from Hassane Skikri through kaggle.com. For convenience sake, we will be referring to the business this dataset comes from as BusA Sales. The set consists of 10000 observations of 10 variables. Though the data in this set is fictional, it represents a realistic example of data that a business may collect during standard operation. Being able to clean, sort, and make sense of such a dataset is a skill that would be beneficial for the vast majority of businesses to employ.
df = read.csv("larger_sales_dataset.csv", sep = ",", header = T)
There are a total of 10 variables in the BusA data set. The fields
that we will be focusing on in this analysis will be:
Product Category, Total Price,
Order Date, Payment Type, and
Order Status.
summary(df)
Order.ID Product.ID Product.Category Quantity
Length:10000 Length:10000 Length:10000 Min. :1.00
Class :character Class :character Class :character 1st Qu.:2.00
Mode :character Mode :character Mode :character Median :3.00
Mean :3.01
3rd Qu.:4.00
Max. :5.00
Unit.Price Total.Price Order.Date Customer.ID
Min. : 10.07 Min. : 10.09 Length:10000 Length:10000
1st Qu.:129.21 1st Qu.: 284.39 Class :character Class :character
Median :251.67 Median : 602.82 Mode :character Mode :character
Mean :253.28 Mean : 762.72
3rd Qu.:378.26 3rd Qu.:1129.88
Max. :499.96 Max. :2499.78
Payment.Type Order.Status
Length:10000 Length:10000
Class :character Class :character
Mode :character Mode :character
Through the following commands, we are able to see that the data was taken over the span of a year from January 1st, 2023 until December 31st, 2023 as well as having 10000 unique customers (no repeated customers throughout dataset).
range(df$Order.Date)
[1] "2023-01-01" "2023-12-31"
length(unique(df$Customer.ID))
[1] 10000
In which category the most money is being made?
What time of year are sales peaking for each category? Overall?
What is the average total for orders? What does this mean regarding optimizing resource distribution?
What percentage of sales are refunded or cancelled?
When taking a look at the set, we can see that BusA Sales sells items from any of 6 categories.
Sports & Outdoors
Home & Kitchen
Beauty & Health
Books
Electronics
Clothing
Yearly.Total = c(sum(df$Total.Price[df$Product.Category == "Sports & Outdoors"]), sum(df$Total.Price[df$Product.Category == "Home & Kitchen"]), sum(df$Total.Price[df$Product.Category == "Beauty & Health"]), sum(df$Total.Price[df$Product.Category == "Books"]), sum(df$Total.Price[df$Product.Category == "Electronics"]), sum(df$Total.Price[df$Product.Category == "Clothing"]))
Yearly.Total <- currency(Yearly.Total, digits = 0L)
totals = data.frame(Product.Category = c("Sports & Outdoors", "Home & Kitchen", "Beauty & Health", "Books", "Electronics", "Clothing"), Yearly.Total)
arrange(totals, desc(Yearly.Total))
Product.Category Yearly.Total
1 Sports & Outdoors $1,313,735
2 Clothing $1,303,679
3 Electronics $1,290,283
4 Home & Kitchen $1,268,472
5 Books $1,249,307
6 Beauty & Health $1,201,765
max(Yearly.Total) - min(Yearly.Total)
[1] $111,970
salesGG = ggplot(df, aes(x=Product.Category))+geom_bar(fill = c("darkred", "seagreen", "orangered", "steelblue", "royalblue", "gold"))+ labs(x = "Product Category", y = "Number of Transactions", title = "Histogram of Transactions by Category")
ggplotly(salesGG)
Category Data
1 Most Popular Sports & Outdoors
2 Least Popular Beauty & Health
3 Range 105 Transactions
We are also able to see that the top 10 Biggest orders of the 2023 year averaged at Total Price of $2491.64. It is also worth noting that in this top ten, we see the category of Books 3 times, Beauty & Health and Electronics 2 times, and Home & Kitchen, Sports & Outdoors, and Clothing only once.
topTen = arrange(select(df, Total.Price, Product.Category), desc(Total.Price))
head(topTen, n = 10)
Total.Price Product.Category
1 2499.785 Home & Kitchen
2 2498.582 Beauty & Health
3 2494.533 Books
4 2494.450 Electronics
5 2492.415 Beauty & Health
6 2490.066 Sports & Outdoors
7 2488.129 Books
8 2487.236 Electronics
9 2485.949 Clothing
10 2485.286 Books
tenAvg = round(sum(topTen$Total.Price[1:10]) / 10, digits = 2)
tenAvg
[1] 2491.64
tail(topTen, n = 10)
Total.Price Product.Category
9991 13.00103 Home & Kitchen
9992 12.18653 Home & Kitchen
9993 12.15613 Sports & Outdoors
9994 12.11102 Sports & Outdoors
9995 11.27107 Beauty & Health
9996 11.11300 Sports & Outdoors
9997 11.05853 Sports & Outdoors
9998 10.80243 Home & Kitchen
9999 10.59864 Home & Kitchen
10000 10.09248 Clothing
tenAvg = round(sum(topTen$Total.Price[9991:10000]) / 10, digits = 2)
tenAvg
[1] 11.44
When breaking down the distribution of sales throughout the year, it immediately becomes clear that the data in this file is fictional. With a normal set of data with variables such as ours, it is typical to expect patterns and trends to appear at different times of the year that correlate with seasons, holidays, and other factors. However, when plotting the total price of each observation against the order date, we can see an extraordinarily consistent layout develop through the 10,000 cases. In order to make the plot more legible, we can sort by category. Doing so does not reduce the amount of clutter, however it does show that the distribution for each of the categories is remarkably similar to eachother with no real outliers becoming apparant.
In order to summarize the plot above in a way that is easier understood, I took the average total for orders for each month and plotted them. Doing so, we are able to see that there is a slight deviation in order totals with the smallest amount made per order during the middle of the year before recovering back into the later months. When looking at the graph, it is important that we keep the y-axis in mind. Though the difference in average order totals seems dramatic, looking at the same plot with the bottom limit of our y-axis dropped to 0, we can see that the difference between the months is quite insignificant.
##Subset with sorted by Order.Date
dff = select(df, Order.Date, Total.Price, Product.Category)
dff$Order.Date = sort(format(as.Date(df$Order.Date), "%m"), decreasing = F)
dff$Order.Date = as.numeric(dff$Order.Date)
##Create vector of month totals
geeba = c()
for (i in 1:12){
geeba = c(geeba, sum(dff$Total.Price[dff$Order.Date > (i - 1) & dff$Order.Date < (i + 1) ]))
geeba[i] = geeba[i] / length(dff$Total.Price[dff$Order.Date > (i - 1) & dff$Order.Date < (i + 1) ])
}
monthDf = data.frame(c(1:12), geeba)
g = ggplot(monthDf, aes(c(1:12), geeba)) + geom_point() + geom_smooth(method = "loess", se = F, formula = y~x, span = 1) + labs(title = "Order Total Averages By Month", x = "Month", y = "Average Totals")
g
g + ylim(0,800)
Due to BusA Sales a variety of categories from which they sell their products from, it would also be beneficial for executives to understand how each of their categories are trending as well. In the plot below, I have used the same format as the one above only this time splitting the data into six individual curves.
Another important part of this data set is the amount of refunded and cancelled orders. Should this be a real business, having a refund and cancel rate of nearly 25% each would be unimaginable. In the plot below, we can see that the refunded and cancelled orders are nearly level with those of the completed and pending orders.
methodGG = ggplot(df, aes(x=Order.Status, fill = Order.Status))+geom_bar() + labs(title = "Order Status Distribution", x = "Order Status", y = "Number of Transactions")
ggplotly(methodGG)
Stat Data
1 Most Frequent Status Pending
2 Least Frequent Status Refunded
3 Range 144 Orders
4 Refunded or Cancelled 49.68%
Finally, below we have the distribution of payment types that customers used during the 2023 fiscal year. Similar to the Order Status plot, we observe a similar amount of transactions between all 4 Payment Types with the most used being Paypal.
payGG = ggplot(df, aes(x=Payment.Type, fill = Payment.Type))+geom_bar() + labs(title = "Forms of Payment Distribution", x = "Payment Type", y = "Number of Transactions")
ggplotly(payGG)
Type Data
1 Most Used Paypal
2 Least Used Gift Card
3 Range 128 Users
In conclusion, the “larger_sales_dataset” by Hassane Skikri produced some interesting results upon data cleaning and wrangling. We were able to determine that during the fiscal year of 2023, BusA Sales had a total of 10,000 customers. These customers ordered the most from the categories of Sports & Outdoors and Clothing and the least from Beauty & Health. They had the highest order totals in the months of October and November, used Paypal the most, and had an almost 50% rate of either returning or cancelling an order.
While the fields that we observed were realistic, the data seen in this file is highly unlikely to come about during an actual business’ operations. In reality, it is likely that you would expect a higher variance in customer preferences (both categorical and chronological), distribution of sales, and a (hopefully) lower margin of refunded/cancelled orders. In addition to what we were given in the set, another field that could prove useful towards BusA’s endeavors could be a customer satisfaction survey of some sort.